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Abstract —We study the excess mean square error (EMSE) 
above the minimum mean square error (MMSE) in large linear 
systems where the posterior mean estimator (PME) is evaluated 
with a postulated prior that differs from the true prior of 
the input signal. We focus on large linear systems where the 
measurements are acquired via an independent and identically 
distributed random matrix, and are corrupted by additive white 
Gaussian noise (AWGN). The relationship between the EMSE in 
large linear systems and EMSE in scalar channels is derived, 
and closed form approximations are provided. Our analysis is 
based on the decoupling principle, which links scalar channels 
to large linear system analyses. Numerical examples demonstrate 
that our closed form approximations are accurate. 

Index Terms —decoupling, large linear systems, mismatched 
estimation. 


I. Introduction 

The posterior mean estimator (PME), also known as condi¬ 
tional expectation, plays a pivotal role in Bayesian estimation. 
To compute the PME, we need a prior distribution for the 
unknown signal. In cases where the prior is unavailable, we 
may compute the PME with a postulated prior, which may 
not match the true prior. Verdu [1] studied the mismatched 
estimation problem for scalar additive white Gaussian noise 
(AWGN) channels and quantified the excess mean square error 
(EMSE) above the minimum mean square error (MMSE) due 
to the incorrect prior. A natural extension to Verdu’s result 
would be to quantify the EMSE due to mismatched estimation 
in large linear systems [2-7], 

Mismatched estimation: Consider scalar estimation, 

Y = X + aW, (1) 

where X is generated by some probability density function 
(pdf) px, W ~ Af( 0,1) is independent of X, and J\f(p,a 2 ) 
denotes the Gaussian pdf with mean p and variance a 2 . A 
PME with some prior qx, which is defined as 

X q (y,a 2 ) = E qx [X\Y = y\, (2) 

can be applied to the estimation procedure, where in E gx [■] 
expectation is calculated assuming that X is distributed as qx- 
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The mean square error (MSE) achieved by X q (y\ a 2 ) is 


M/ g (a 2 ) = E px {X q {--a 2 )-Xf 


(3) 


Note that X. p (y;cr 2 ) is the MMSE estimator, and \J/ p (cr 2 ) is 
the MMSE. 

Having defined notation for the MSE, we can now define 
the EMSE above the MMSE due to the mismatched prior in 
the scalar estimation problem (1): 


EMSE s (cr 2 ) = ^(a 2 ) - T p (ct 2 ), (4) 


where the subscript s represents scalar estimation. 

Verdu [1] proved that EMSE,, is related to the relative 
entropy [8] as follows: 


EMSE s (cr 2 ) 



^P.y 



Qx * N 



where * represents convolution, 7 = l/cr 2 , Px and Q x 
are the true and mismatched probability distributions, re¬ 
spectively, and under the assumption that P is absolutely 
continuous with respect to Q, the relative entropy, also known 
as Kullback-Leibler divergence [8], is defined as D(P\\Q) = 

/log(i)dP. 

Large linear system: Consider a linear system: 


y = Ax + z, (5) 

where the input signal x £ M. N is a sequence of independent 
and identically distributed (i.i.d.) random variables generated 
by some pdf px, A € R MxAr is an i.i.d. Gaussian random 
matrix with A iq ~ A/"(0, l/VM), and z € R M is AWGN with 
mean zero and variance a 2 . Large linear systems [2-7], which 
are sometimes called the large system limit of linear systems, 
refer to the limit that both N and M tend to infinity but with 
their ratio converging to a positive number, i.e., N —> oo and 
M/N —» S. We call 6 the measurement rate of the linear 
system. 

Decoupling principle: The decoupling principle [3] is 
based on replica analysis of randomly spread code-division 
multiple access (CDMA) and multiuser detection. It claims 
the following result. Define the PME with prior qx of the 
linear system (5) as 


x 9 =E gx [x|y], 


(6) 



Denote the j-th element of the sequence x by x.(j). For any 
j £ {1, 2,iV}, the joint distribution of (x(j),x q (j)) con¬ 
verges in probability to the joint distribution of (X, X q (-;o 2 )) 
in large linear systems, where X ~ px and cr 2 is the solution 
to the following fixed point equation [2, 3], 

6-(a 2 -a 2 ) = V q (a 2 ). ( 7 ) 

Note that X q (•; a 2 ) is defined in (2), and there may be multiple 
fixed points [3,9,10]. 

The decoupling principle provides a single letter charac¬ 
terization of the MSE; state evolution [11] in approximate 
message passing (AMP) [6] provides rigorous justification for 
the achievable part of this characterization. An extension of 
the decoupling principle to a collection of any finite number 
of elements in x is provided by Guo et al. [5], 



Illustration: Figure 1 highlights our contribution, which 
is stated in Claim 1, using an example that compares the 
EMSE in scalar channels and large linear systems. The solid 
line with slope 5 represents the linear function of a 2 on 
the left-hand-side of (7). The dashed and dash-dotted curves 
represent ^> p (a 2 ) and A/ q (a 2 ), respectively. The intersection 
point a provides the solution to the fixed point equation 
(7) when X p is applied, and so the Cartesian coordinate 
representation of a is a = (cr 2 , ’f' p (cr 2 )). Similarly, we have 
b = (cr 2 ,'T g ((j 2 )). Therefore, the vertical distance between a 
and b is the difference in MSE of the two decoupled channels 
for the PME with the true prior and the mismatched prior, 
respectively. Based on the decoupling principle, the vertical 
distance is equivalent to the EMSE in large linear systems, 
which we denote by EMSE; and define as 


EMSE; 


lim 

N—too 


n E Px 




( 8 ) 


where the subscript l represents large linear systems, St q 
is defined in (6), and x p is defined similarly. The pair 
c = ( a 2 ,^f q (o 2 )) represents the MSE achieved by X q in 
the decoupled scalar channel for X p , and so the vertical 
distance between a and c is EMSE s (cr 2 ). We can see from 
Figure 1 that EMSE; is larger than EMSE s (cr 2 ), despite using 
the same mismatched prior for PME, because the decoupled 
scalar channel becomes noisier as indicated by the horizontal 
move from c to b; we call this an amplification effect. Our 
contribution is to formalize the amplification of EMSE s (a 2 ) 
to EMSEi. 

The remainder of the paper is organized as follows. We 
derive the relationship between the EMSE in large linear 
systems and EMSE in scalar channels in Section II; closed 
form approximations that characterize this relationship are also 
provided. Numerical examples that evaluate the accuracy level 
of our closed form approximations are presented in Section III, 
and Section IV concludes the paper. 


Figure 1: Mismatched estimation for Bemoulli-Gaussian input 
signal (px{x) = (1 — 0)<5oO*O + 6W(0,1)) in large linear 
systems. We notice that the EMSE in the scalar estimation 
problem is amplified in the large linear system, despite using 
the same mismatched prior for PME, because the decoupled 
scalar channel becomes noisier as indicated by the horizontal 
move from c to b. (5 = 0.2, o 2 = 0.03, the true sparsity 
parameter 0 = 0.1, and the mismatched sparsity parameter 
6 = 0 . 2 .) 


II. Main Results 

We now characterize the relationship between EMSE; and 
EMSE s (cr 2 ), which is the EMSE of the mismatched PME in 
the decoupled scalar AWGN channel for the PME with the 
true prior px- 

Our main result is summarized in the following claim. We 
call this result a claim, because it relies on the decoupling 
principle [3], which is based on the replica method and lacks 
rigorous justification. 

Claim 1 . Consider a large linear system (5). Let the EMSE in 
scalar channels EMSE s (a 2 ) and EMSE in large linear systems 
EMSEi be defined in (4) and (8), respectively, and let a 2 be the 
noise variance in the decoupled scalar channel when the true 
prior px is applied. Denote 'IF (cr 2 ) = d ^ 2 ^ fPq(ex 2 ), where 
q (a 2 ) is the MSE in the scalar AWGN channel with noise 
variance o 2 achieved by the PME with a mismatched prior 
qx (3). In large linear systems, EMSEi andEMSE s (a 2 ) satisfy 
the following relation: 


+ j EMSEi 

EMSEi = EMSE s (a 2 ) + / (<r 2 )cfo- 2 . 

J 


(9) 


Justification: The fixed point equations (7) when applying 


















TABLE I: Relative error in the Bernoulli example 


X p and X q are 

5 ■ (o-p - <t?) = and (10) 

8- (vq-Vz) =*g(^)» (H) 

respectively. Subtract (10) from (11): 

<5- -o-p) = -^p(o-p)- (12) 

Recall that 


EMSE; = T> 9 (ct 2 ) - (13) 

EMSE s (a 2 ) = *,(<# - * p (a 2 p ). 

Combining the two equations: 


EMSE; = EMSE. 


= EMSE 


(4) + * g (o*)-9 q (o*) 

*> 2 )da 2 , 


,(4 )+r 

Jcl 


(14) 


where (9) follows by noticing from (12) and (13) that a q — 
a 2 p = 4 EMSE;. □ 

Approximations: Consider a Taylor expansion of T' q (cr 2 ) 
at cr 2 : 


= *«(<# + a{a 2 q - a 2 p ) + ^ 0(o* - a 2 ) 2 

+ o((a 2 -a 2 ) 2 ) 

, o EMSE; /? /EMSE; 

= *,(^) + a— + 1 ( — 


+ o 



(15) 


where ct = 1 1L (a 2 ) and /? = ^"(er 2 ) are the first and second 
derivatives, respectively, of evaluated at a 2 , and A = 

Plug (15) into (14): 


EMSE; 


EMSE s (cr 2 )+a 


EMSE; 

S 



(16) 

If we only keep the first order terms in (16) and solve 
for EMSE;, then the first order approximation of EMSE; is 
obtained in closed form: 


EMSE; 


C 

- -EMSE s (ct 2 ) + o(A). 

d — a y 


(17) 


Note that > 1. That is, the EMSE in the scalar estimation 
problem is amplified in large linear systems despite using 
the same mismatched prior qx for PME. This is due to an 
increased noise variance in the decoupled channel for the 
mismatched prior beyond the variance for the correct prior. 


l h( A) = o(g( A)) if lim = o. 

A—>0 g( A) 


0 

A 

1st (17) 

2nd (18) 

2nd (19) 

0.11 

0.0008 

0.13% 

<0.0005% 

<0.0001% 

0.13 

0.0070 

1% 

0.041% 

0.017% 

0.15 

0.0178 

2.8% 

0.28% 

0.11% 

0.17 

0.0324 

5.2% 

0.99% 

0.35% 

0.20 

0.0603 

10% 

4% 

1% 


Similarly, the second order approximation is given by 


EMSE; 


()EMSE s (cr2) 

5 — a 



1 /3EMSE s (<jp) 

2 (5 — a) 2 



(18) 


under the condition that d ’ q (<J 2 ) is locally concave in a 2 
around o' p \ details in the Appendix. A more accurate approx¬ 
imation is also presented in the Appendix in (19). 

We expect that when the mismatched distribution qx is 
close to px, ^picr 2 ) and d/ 9 (cr 2 ) are close to each other for 
all <t 2 , and A = a 2 — <j 2 is smaller for minor mismatch than 
significant mismatch with the same slope of d/ g ((7 2 ) at er 2 and 
the same 6. Therefore, the first order approximation of ’^’ q (a 2 ) 
for a 2 £ [a pl a 2 + A] is more likely to be reasonably accurate 
for minor mismatch. When the mismatch is significant, we 
might need to include the second order term in the Taylor 
expansion (15) to improve accuracy. Numerical examples that 
show the necessity of the second order term when there is 
significant mismatch will be shown in Section III. 


III. Numerical Examples 
A. Accuracy of approximations 

We begin with two examples that examine the accuracy level 
of our first and second order approximations given by (17)- 
(19). 

Example 1: Bernoulli input. The input signal of the first 
example follows a Bernoulli distribution with px{ 1) = 0 
and p,y(0) = 1 — 6. Let 6 = 0.1, and let the mismatched 
Bernoulli parameter 6 vary from 0.11 (minor mismatch) 
to 0.20 (significant mismatch). The linear system (5) has 
measurement rate 6 = 0.2 and measurement noise variance 
er? = 0.03. Using (7), this linear system with the true 
Bernoulli prior yields er 2 = 0.34. Table I shows the accuracy 
level of our three approximations (17), (18) and (19) for 
the Bernoulli example. The relative error of the predicted 
EMSE;, which is denoted by EMSE;(pred), is defined as 
|EMSE; — EMSE;(pred)|/EMSE;, where the first and second 
order approximations for EMSE;(pred) are given by (17)-( 19), 
respectively. 

Example 2: Bernoulli-Gaussian input. Here the input 
signal follows a Bernoulli-Gaussian distribution px(x) = 
9Af{ 0,1) + (1 — 0)<5o(x), where <5 0 (-) is the delta function [12]. 
Let 6 = 0.1, and let the mismatched parameter 6 vary from 
0.11 to 0.20 as before. The linear system is the same as in 



TABLE II: Relative error in the Bernoulli-Gaussian example 


0 

A 

1st (17) 

2nd (IB) 

2nd (19) 

0.11 

0.0006 

0.25% 

0.09% 

0.09% 

0.13 

0.0052 

1.4% 

0.04% 

0.08% 

0.15 

0.0132 

3.5% 

0.21% 

0.03% 

0.17 

0.0240 

6.5% 

0.98% 

0.08% 

0.20 

0.0444 

13% 

4% 

0.4% 


the Bernoulli example with S = 0.2 and a 2 = 0.03, and 
this linear system with the correct Bernoulli-Gaussian prior 
yields o 2 = 0.27. Figure 1 compares the PME with the 
true parameter and the PME with the mismatched parameter 
9 = 0.2. The accuracy level of our approximations (17)-(19) 
for the Bernoulli-Gaussian example is shown in Table II. 

It can be seen from Tables I and II that when the mismatch 
is minor, the first order approximation can achieve relative 
error below 1%. However, as the mismatch increases, we may 
need to include the second order term to reduce error. 

B. Amplification for non-i.i.d. signals 

To show that the results of this paper are truly useful, it 
would be interesting to evaluate whether the amplification 
effect can be used to characterize the performance of more 
complicated problems that feature mismatch. By their nature, 
complicated problems may be difficult to characterize in closed 
form, not to mention that the theory underlying them may not 
be well-understood. Therefore, we will pursue a setting where 
a large linear system is solved for non-i.i.d. signals. Note that 
neither state evolution [11] nor the decoupling principle [3] 
has been developed for such non-i.i.d. settings, and so any 
predictive ability would be heuristic. We now provide such an 
example, and demonstrate that we can predict the MSE of one 
ad-hoc algorithm from that of another. 

Example 3: Markov-constant input. Our Markov-constant 
signal is generated by a two-state Markov state machine 
that contains states so (zero state) and si (nonzero state). 
The signal values in states So and si are 0 and 1, respec¬ 
tively. Our transition probabilities are p(so|si) = 0.2 and 
p(si|s 0 ) = 1/45, which yield 10% nonzeros in the Markov- 
constant signal. In words, this is a block sparse signal whose 
entries typically stay “on” with value 1 roughly 5 times, and 
then remain “off for roughly 45 entries. The optimal PME for 
this non-i.i.d. signal requires the posterior to be conditioned 
on the entire observed sequence, which seems computationally 
intractable. Instead, we consider two sub-optimal yet practical 
estimators where the conditioning is limited to signal entries 
within windows of sizes two or three. We call these practical 
approaches Bayesian sliding window denoisers: 

x 2 (i) = E [x{i)\{y(i - l),t/(i))] , 

x 3 {i) = E [x(i)\(y(i — 1), y(i), y(i + 1))] • 

Increasing the window-size would capture more statistical 
information about the signal, resulting in better MSE per¬ 


formance. Unfortuntely, a larger window-size requires more 
computation. 

To decide whether a fast and simple yet less accurate 
denoiser should be utilized instead of a more accurate yet 
slower denoiser while still achieving an acceptable MSE, we 
must predict the difference in MSE performance between the 
two. Such a prediction is illustrated in Figure 2. The dash- 
dotted curve represents the MSE achieved by AMP using the 
better denoiser X 3 as a function of the measurement rate 5, 
where for each S , we obtain a solution to the fixed point 
equation (7), which is denoted by er 2 (<5). The dashed curve 
represents the MSE achieved by x-> in scalar channels with 
noise variance cr 2 (S). Note, however, that x 2 uses window- 
size 2 whereas x-j, uses window-size 3 and can gather more 
information about the signal. Therefore, x 2 obtains higher 
(worse) MSE despite denoising a statistically identical scalar 
channel. The vertical distance between the dashed and dash- 
dotted curves is EMSE.,. 

What MSE performance can be obtained by using the faster 
denoiser x 2 within AMP? The predicted MSE performance 
using x 2 within AMP is depicted with crosses; the prediction 
relies on our second order approximation (19). That is, the 
vertical distance between the crosses and dash-dotted curve is 
EMSE/ (19). Finally, the solid curve is the true MSE achieved 
by AMP with x 2 . The reader can see that the predicted MSE 
performance may help the user decide which denoiser to 
employ within AMP. 

Whether the successful predictive ability in this example is 
a coincidence or perhaps connected to a theory of decoupling 
for non-i.i.d. signals remains to be seen. Nevertheless, this 
example indicates that the formulation (9) that relates the 
mismatched estimation in scalar channels to that in large 
linear systems, as well as its closed form approximations (17), 
(18), and (19) can already be applied to far more complicated 
systems for practical uses. 

IV. Conclusion 

We studied the excess mean square error (EMSE) above 
the minimum mean square error (MMSE) in large linear 
systems due to the mismatched prior for the posterior mean 
estimator (PME). In particular, we derived the relationship 
between the EMSE in large linear systems and that in scalar 
channels; three simple approximations to this relationship were 
provided. Numerical examples show that our approximations 
are accurate, indicating that they can be used to predict the 
EMSE in large linear systems due to mismatched estimation. 

Appendix: Second Order Approximation of EMSE; 

Solving (16) for EMSE;: 

EMSE, = | ((« - a) ± y(«-a) 2 -2^EMSE,(^)) 



0.07 



Figure 2: Prediction of MSE for Markov-constant input signal 
(p(si|so) = 1/45 and p(so|si) = 0.2) in large linear systems. 
In the legend, a 2 (5) represents the noise variance that AMP 
with X 3 converges to at measurement rate S. (a 2 = 0.1.) 


Recall that a 2 is the noise variance in the decoupled 
scalar channel for the MMSE estimator, and 4 ' j ,(ct/) is the 
MMSE. That is, a 2 > a 2 and ^ q (a 2 ) > ’i’ p (a 2 ), for all 
qx 7 ^ Px- Hence, A = a 2 — a 2 > 0 and EMSE s (a 2 ) = 
'T q (a 2 ) — T , p ((jp) > 0. Under the condition that (cr 2 ) is 
locally concave around a 2 , the second derivative of ^ q (a 2 ) 
at a 2 is negative. That is, /3 < 0. Therefore, (5 — a ) 2 — 
2/3EMSE s (cr 2 ) > (6 - a) 2 , and 

EMSE; = ^(<5 - a) - f(S - a) 2 - 2/3EMSE s (a 2 ) 

(a 2 ) 


+ 0 


(5- (6 -a) l 1 _ 2/?EMSE s (ct 2 ) 


(6~a) 2 


+ o 


K>- 


A Taylor expansion of \/1 + x yields 

\/l+X = 1 + -x — -x 2 + o(x 2 ). 

2 8 

Let x = — 2 ^™se 3 (gp) ^en 

(d — OL) Z ’ 


2/3EMSE s (a2) 


' 1 - 


(<5 — a ) 2 

/3EMSE s (a 2 p ) 
(6 - a ) 2 
0 (EMSE s (^) 2 ) . 


= 1 - 


EMSE s (cr; 
(6 — a ) 4 


(19) 


(20) 


Plugging (20) into (19), 


EMSE Z 


Note that 


5EMSE s ( CT 2) / 1 /3EMSE s (tjp))\ 

(<5 — a) ^ + 2 (5 — a ) 2 ) 

+ o(eMSE s ( ( t 2 ) 2 ) +o(A 2 ) . 

EMSE s (<r 2 ) 2 ^ EMSEf 2 

A 2 A A 2 - d ’ 


and so, 


lim 

A->0 


EMSEg (cr 2 ) 2 
A 2 


< S 2 < 00, 


which implies that writing o ^EMSEs(cr 2 ) 2 j + o (A 2 ) is 
equivalent to writing o (A 2 ). 

We have proved the desired result (18). 
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